Paper: The phylogenetic range of bacterial and viral pathogens of vertebrates preprint: https://doi.org/10.1101/670315

Authors: Liam P. Shaw*, Alethea Wang*, David Dylus, Magda Meier, Grega Pogacnik, Christophe Dessimoz, Francois Balloux (* co-first authors)

Introduction

This is a supplementary R Markdown workbook (Supplementary Text 1) containing code to reproduce all main figures and analyses in the associated paper, as well as several supplementary files. Code can be shown or hidden overall (see top right) or for individual sections.

Code can be altered to change any of the figures. Some explanatory text is provided, but not much. Any questions can be addressed via email ().

Unless otherwise stated, data related to viruses is plotted in red and bacteria in black. The code should take <10 minutes to run on a standard laptop.

Note: At several points code is adapted/reused from the supplementary code repository made available by Olival et. al. (2017):

This mainly applies to several scripts in the `scripts’ directory relating to the fitting and plotting of generalized additive models (GAMs) – more information is included where appropriate. All code here is also made available under an MIT License.

Preliminary steps

This section contains library requirements and defines two useful functions:

First, the data is read in: the database of host-pathogen associations (tidy format i.e. one association per row); and the host phylogenetic tree based on mitochondrial gene alignments.

# Host-pathogen association database
pathogen_vs_host_db <- read.csv('data/PathogenVsHostDB-2019-05-30.csv', stringsAsFactors = F)
# Host phylogenetic tree
host_tree <- read.tree('data/HostPhylogeneticTree-2019-05-30.nwk')
# Uncomment to use only species which do not disrupt monophyly (n=40)
#host_tree <- read.tree('HostPhylogeneticTree-2019-05-30-pruned.nwk')

# Cophenetic matrix (from phylogenetic tree)
cophenetic_matrix <- cophenetic.phylo(host_tree) 

# Keep only overlap 
host_tree <- drop.tip(host_tree, host_tree$tip.label[which(!host_tree$tip.label %in% pathogen_vs_host_db$HostSpeciesPHB)])
pathogen_vs_host_db <- pathogen_vs_host_db[which(pathogen_vs_host_db$HostSpeciesPHB %in% host_tree$tip.label),]


# Subset database into virus and bacteria
virus <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Virus"),]
bacteria <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Bacteria"),]

Initial statistics

To begin with, some overall statistics which are cited in the main text.

# List of host orders
host_orders <- unique(as.character(pathogen_vs_host_db$HostOrder))
n.host.orders <- length(host_orders)
# Calculate THR for each host order
order_thr <- sapply(host_orders, function(x) THR(x, FUN=max))
median.inter.host.distance.order <- median(order_thr)

The number of host orders is 90.

The medium maximum inter-host distance within an order is 0.323.

Vector-borne

We find that viruses are significantly more likely to be vector-borne compared with bacteria.

vector.tab <- table(pathogen_vs_host_db[which(!duplicated(pathogen_vs_host_db$Species)),"VectorBorne"], pathogen_vs_host_db[which(!duplicated(pathogen_vs_host_db$Species)),"Type"])
vector.tab <- vector.tab[c("No", "Yes"),]
chisq.test(vector.tab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  vector.tab
## X-squared = 91.529, df = 1, p-value < 2.2e-16

Removing humans from the database

We want to think about pathogen range etc. after removing them from the database and the host tree. We recalculate PHB after doing this.

# Removing humans from database
pathogen_vs_host_db_no_humans <- pathogen_vs_host_db[which(pathogen_vs_host_db$HostSpeciesPHB!="Homosapiens"),]
t.no.humans <- drop.tip(host_tree, tip = host_tree$tip.label[!host_tree$tip.label %in% pathogen_vs_host_db_no_humans$HostSpeciesPHB])
d.no.humans <- cophenetic.phylo(t.no.humans)
# Bacteria PHB
bacteria.names <- names(sort(table(bacteria$Species), decreasing=TRUE))
bacteria.no.humans.phbs <- as.numeric(sapply(bacteria.names, function(x) PHB(x, 
                                                                             m=pathogen_vs_host_db_no_humans, 
                                                                             cophenetic.matrix = d.no.humans)))
virus.names <- names(sort(table(virus$Species), decreasing=TRUE))
virus.no.humans.phbs <- as.numeric(sapply(virus.names, function(x) PHB(x,                                                                 m=pathogen_vs_host_db_no_humans, 
                                                                       cophenetic.matrix = d.no.humans)))
virus.no.humans.phbs.range <- as.numeric(sapply(virus.names, function(x) PHB(x,                                                                 m=pathogen_vs_host_db_no_humans, 
                                                                             cophenetic.matrix = d.no.humans, FUN = max)))

Summary statistics

It is useful to have data frames with the summary statistics for each pathogen species (i.e. one row per pathogen), calculated from the pathogen-host association database (one row per association). We separate this into viruses and bacteria, as we treat them separately.

Viruses

options(warn=-1) # Turn off warnings
virus.names <- names(sort(table(virus$Species), decreasing=TRUE))
n.hosts <- as.numeric(sort(table(virus$Species), decreasing=TRUE))
virus.phbs <- sapply(virus.names, function(x) PHB(x))
virus.unique <- virus[which(!duplicated(virus$Species)),]
rownames(virus.unique) <- virus.unique$Species
proteins <- as.numeric(virus.unique[virus.names, "ProteinCount"])

virus.families <- virus.unique[virus.names, "Family"]

# Combine into a data frame of statistics for each viral pathogen
virus.df <- data.frame(cbind(virus.phbs, proteins))

genome.size <- as.numeric(virus.unique[virus.names, "Gsize"])
virus.df$genome.size <- genome.size
virus.df$family <- virus.families
#
n.hosts <- as.numeric(sort(table(virus$Species), decreasing=TRUE))

virus.df$n.hosts <- n.hosts

genome.type <- virus.unique[virus.names, "Genome"]
virus.df$genome.type <- genome.type # Genome type
# Genome type (just RNA/DNA)
genome.type.rna.dna <- ifelse(genome.type %in% c("(-) ssRNA", "(+) ssRNA",
                                                 "ssRNA-RT", "dsRNA"), "RNA",
                              ifelse(genome.type %in% c("dsDNA", "dsDNA-RT", "ssDNA"), 
                                     "DNA", ""))
virus.df$genome.type.rna.dna <- genome.type.rna.dna 

zoonotic <-  ifelse(virus.unique[virus.names, "Zoonotic"]=="Yes", "Zoonotic", "Not zoonotic")
virus.df$n.hosts <- n.hosts # Number of hosts
virus.df$zoonotic <- zoonotic # Zoonotic
virus.df$virus.phbs.no.human <- virus.no.humans.phbs # Add non-human mean PHB
virus.df$virus.phbs.max.no.human <- virus.no.humans.phbs.range # Add non-human maximum PHB
virus.df$vector.borne <- virus.unique[virus.names, "VectorBorne"]

Bacteria

# Get bacterial pathogen species names
bacteria.names <- names(sort(table(bacteria$Species), decreasing=TRUE))
# Get number of hosts for each pathogen
n.hosts <- as.numeric(sort(table(bacteria$Species), decreasing=TRUE))
# Get mean PHB for each pathogen
bacteria.phbs <- as.numeric(sapply(bacteria.names, function(x) PHB(x)))
# Dataframe of unique bacterial pathogen species
bacteria.unique <- bacteria[which(!duplicated(bacteria$Species)),]
rownames(bacteria.unique) <- bacteria.unique$Species

# Extract relevant lifestyle variables:
# motility, cell, Gram stain, 
# spore formation, oxygen requirements, zoonotic
motility <- bacteria.unique[bacteria.names, "Motility"]
cell <- bacteria.unique[bacteria.names, "Cell"]
gram <- bacteria.unique[bacteria.names, "GramStain"]
spore <- bacteria.unique[bacteria.names, "Spore"]
oxygen <-  bacteria.unique[bacteria.names, "Oxygen"]
zoonotic <-  ifelse(bacteria.unique[bacteria.names, "Zoonotic"]=="Yes", "Zoonotic", "Not zoonotic")

# Combine this information into a single data frame of statistics for each bacterial pathogen
bacteria.df <- data.frame(cbind(bacteria.names, n.hosts, bacteria.phbs, 
                                motility, gram, spore, oxygen, zoonotic, cell))
bacteria.df$bacteria.phbs <- as.numeric(as.character(bacteria.df$bacteria.phbs)) # Make sure numeric variable
bacteria.df$bacteria.phbs.no.human <- bacteria.no.humans.phbs # Add in PHB without humans
bacteria.df$family <- bacteria.unique[bacteria.names, "Family"] # Add taxonomic family
bacteria.df$genus <- bacteria.unique[bacteria.names, "Genus"] # Add taxonomic genus
bacteria.df$vector.borne <- bacteria.unique[bacteria.names, "VectorBorne"]

Figures

Figure 1

This is the schematic overview figure, and was made manually in Inkscape using images from FlatIcon (see references in main manuscript).

Figure 2

We get a dataset of just the unique pathogens and then calculate the mean phylogenetic host breadth (PHB) of each pathogen. We then use this to plot a histogram.

options(warn=-1) # Turn off warnings
# Get unique pathogens
unique_pathogens <- data.frame(Species=pathogen_vs_host_db[!duplicated(pathogen_vs_host_db$Species),"Species"],
                          Type=pathogen_vs_host_db[!duplicated(pathogen_vs_host_db$Species),"Type"])
unique_pathogens$Species <- as.character(unique_pathogens$Species)
# Calculate PHB
unique_pathogens$PHB <- sapply(unique_pathogens$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_matrix, m = pathogen_vs_host_db))
unique_pathogens$PHB.median <- sapply(unique_pathogens$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_matrix, m = pathogen_vs_host_db, FUN=median))
unique_pathogens$PHB.max <- sapply(unique_pathogens$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_matrix, m = pathogen_vs_host_db, FUN = max))


# Number of total hosts
host_species_counts <- data.frame(pathogen_vs_host_db %>% group_by(Species) %>% summarise(count=length(HostSpeciesPHB)))
rownames(host_species_counts) <- host_species_counts$Species
unique_pathogens$N.hosts <- host_species_counts[unique_pathogens$Species,"count"]

# Gather data
pathogen_range_histo <- data.frame(  phb = 1:50/30-0.03, 
                                     bacteria = hist(filter(unique_pathogens, Type == "Bacteria")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count,
                                     virus = hist(filter(unique_pathogens, Type == "Virus")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count) %>% gather(type, count, -phb)
pathogen_range_histo$count.norm <- pathogen_range_histo$count/sum(pathogen_range_histo$count)
pathogen_range_histo$type <- ifelse(pathogen_range_histo$type=="bacteria", "Bacteria", "Virus")

pathogen_range_histo.total <- data.frame(pathogen_range_histo %>% group_by(phb) %>% summarise(count=sum(count)))
# Pathogen range histogram, excluding zeros
pathogen_range_histo$type <- ordered(pathogen_range_histo$type, levels=c("Virus", "Bacteria"))
levels(pathogen_range_histo$type) <- c("(a) Viruses", "(b) Bacteria")
p.histogram.figure.2 <- ggplot(pathogen_range_histo, aes(fill=type, x=phb, y=count))+
  geom_histogram(colour="black", aes(fill="All"), data=pathogen_range_histo.total, stat="identity", alpha=0.4)+
  ylim(c(0,75))+geom_bar(stat="identity")+
  facet_wrap(~type, ncol=1)+
  scale_fill_manual(values=c("#de2d26", "#252525", "#f7f7f7"))+
  xlim(c(0,1))+theme_basic()+
  xlab("Mean PHB")+
  ylab("Frequency")+
  labs(fill="Pathogen type")+
  geom_vline(xintercept = median(filter(unique_pathogens, Type == "Bacteria")$PHB[which(filter(unique_pathogens, Type == "Bacteria")$PHB!=0)]), 
             colour="white", size=2, alpha=0.8)+
    geom_vline(xintercept = median(filter(unique_pathogens, Type == "Bacteria")$PHB[which(filter(unique_pathogens, Type == "Bacteria")$PHB!=0)]), 
             colour="black", size=1, linetype="dashed")+ # Second to make colour clear
    geom_vline(xintercept = median(filter(unique_pathogens, Type == "Virus")$PHB[which(filter(unique_pathogens, Type == "Virus")$PHB!=0)]), 
             colour="white", size=2, alpha=0.8)+
    geom_vline(xintercept = median(filter(unique_pathogens, Type == "Virus")$PHB[which(filter(unique_pathogens, Type == "Virus")$PHB!=0)]), 
             colour="red", size=1, linetype="dashed")+ # Second to make colour clear
  theme(axis.text=element_text(size=16),
        legend.text =element_text(colour="black", size=16),
        legend.title=element_text(colour="black", size=18),
        strip.text.x=element_text(colour="black", size=22),
        axis.title=element_text(colour="black", size=22))+
  theme(legend.position="none")
# Perform Wilcox test (for the caption of the figure)
wilcox.test(unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Bacteria"),"PHB"], unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Virus"),"PHB"])
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  unique_pathogens[which(unique_pathogens$PHB > 0 & unique_pathogens$Type ==  and unique_pathogens[which(unique_pathogens$PHB > 0 & unique_pathogens$Type ==     "Bacteria"), "PHB"] and     "Virus"), "PHB"]
## W = 105382, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
median(unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Bacteria"),"PHB"])
## [1] 0.5199371
median(unique_pathogens[which(unique_pathogens$PHB>0 & unique_pathogens$Type=="Virus"),"PHB"])
## [1] 0.4091278
# Show plot
p.histogram.figure.2

# Save figure
pdf("figures/Figure-2-histogram-PHB.pdf", width=12, height=8)
p.histogram.figure.2
dev.off()
## quartz_off_screen 
##                 2
png("figures/Figure-2-histogram-PHB.png", width=1200, height=800)
p.histogram.figure.2
dev.off()
## quartz_off_screen 
##                 2
options(warn=-1) # Turn off warnings
# Function to plot PHB histogram for a given dataset
plotHistogram <- function(association_dataset, cophenetic_distances){
  # Get unique pathogens
  unique_pathogens.subsampled <- data.frame(Species=association_dataset[!duplicated(association_dataset$Species),"Species"],
                                            Type=association_dataset[!duplicated(association_dataset$Species),"Type"])
  unique_pathogens.subsampled$Species <- as.character(unique_pathogens.subsampled$Species)
  # Calculate PHB
  unique_pathogens.subsampled$PHB <- sapply(unique_pathogens.subsampled$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_distances, m = association_dataset))
  unique_pathogens.subsampled$PHB.median <- sapply(unique_pathogens.subsampled$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_distances, m = association_dataset, FUN=median))
  unique_pathogens.subsampled$PHB.max <- sapply(unique_pathogens.subsampled$Species, function(x) PHB(x, cophenetic.matrix = cophenetic_distances, m = association_dataset, FUN = max))
  
  
  # Number of total hosts
  host_species_counts <- data.frame(association_dataset %>% group_by(Species) %>% summarise(count=length(HostSpeciesPHB)))
  rownames(host_species_counts) <- host_species_counts$Species
  unique_pathogens.subsampled$N.hosts <- host_species_counts[unique_pathogens.subsampled$Species,"count"]
  
  # Gather data
  pathogen_range_histo <-  data.frame(phb = 1:50/30-0.03, 
                                       bacteria = hist(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count,
                                       virus = hist(filter(unique_pathogens.subsampled, Type == "Virus")$PHB, breaks = 0:50/30, freq = TRUE, plot = FALSE)$count) %>% pivot_longer(cols=c("bacteria", "virus"), names_to = "type", values_to = "count")
  pathogen_range_histo$count.norm <- pathogen_range_histo$count/sum(pathogen_range_histo$count)
  pathogen_range_histo$type <- ifelse(pathogen_range_histo$type=="bacteria", "Bacteria", "Virus")
  
  pathogen_range_histo.total <- data.frame(pathogen_range_histo %>% group_by(phb) %>% summarise(count=sum(count)))
  # Pathogen range histogram, excluding zeros
  pathogen_range_histo$type <- ordered(pathogen_range_histo$type, levels=c("Virus", "Bacteria"))
  levels(pathogen_range_histo$type) <- c("(a) Viruses", "(b) Bacteria")
  p.histogram <- ggplot(pathogen_range_histo, aes(fill=type, x=phb, y=count))+
    geom_histogram(colour="black", aes(fill="All"), data=pathogen_range_histo.total, stat="identity", alpha=0.4)+
    ylim(c(0,75))+geom_bar(stat="identity")+
    facet_wrap(~type, ncol=1)+
    scale_fill_manual(values=c("#de2d26", "#252525", "#f7f7f7"))+
    xlim(c(0,1))+theme_basic()+
    xlab("Mean PHB")+
    ylab("Frequency")+
    labs(fill="Pathogen type")+
    geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB[which(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB!=0)]), 
               colour="white", size=2, alpha=0.8)+
    geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB[which(filter(unique_pathogens.subsampled, Type == "Bacteria")$PHB!=0)]), 
               colour="black", size=1, linetype="dashed")+ # Second to make colour clear
    geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Virus")$PHB[which(filter(unique_pathogens.subsampled, Type == "Virus")$PHB!=0)]), 
               colour="white", size=2, alpha=0.8)+
    geom_vline(xintercept = median(filter(unique_pathogens.subsampled, Type == "Virus")$PHB[which(filter(unique_pathogens.subsampled, Type == "Virus")$PHB!=0)]), 
               colour="red", size=1, linetype="dashed")+ # Second to make colour clear
    theme(axis.text=element_text(size=16),
          legend.text =element_text(colour="black", size=16),
          legend.title=element_text(colour="black", size=18),
          strip.text.x=element_text(colour="black", size=22),
          axis.title=element_text(colour="black", size=22))+
    theme(legend.position="none")
  return(p.histogram)
}

Subsampling to non-human hosts

options(warn=-1) # Turn off warnings
pathogen_vs_host_db_domestic <- pathogen_vs_host_db[which(pathogen_vs_host_db$Domestic=="Yes" & pathogen_vs_host_db$HostSpeciesPHB!="Homosapiens"),]
cophenetic_domestic <- cophenetic_matrix[which(rownames(cophenetic_matrix) %in% pathogen_vs_host_db_domestic$HostSpeciesPHB), which(colnames(cophenetic_matrix) %in% pathogen_vs_host_db_domestic$HostSpeciesPHB)]
p.PHB.domestic <- plotHistogram(pathogen_vs_host_db_domestic, cophenetic_domestic)
options(warn=-1) # Turn off warnings
pathogen_vs_host_db_non_domestic <- pathogen_vs_host_db[which(pathogen_vs_host_db$Domestic=="No" & pathogen_vs_host_db$HostSpeciesPHB!="Homosapiens"),]
cophenetic_non_domestic <- cophenetic_matrix[which(rownames(cophenetic_matrix) %in% pathogen_vs_host_db_non_domestic$HostSpeciesPHB), which(colnames(cophenetic_matrix) %in% pathogen_vs_host_db_non_domestic$HostSpeciesPHB)]
p.PHB.nondomestic <- plotHistogram(pathogen_vs_host_db_non_domestic, cophenetic_non_domestic)
cowplot::plot_grid(p.PHB.domestic+ggtitle("Domestic (n=2,270)"), p.PHB.nondomestic + ggtitle("Non-domestic (n=5,898)"))

Figure 3

We want to produce an analagous version of Figure 1 from Olival et al. (2017) with our data.

options(warn=-1) # Turn off warnings

# First make a dataset per order of mammals 
# Note that we have Artiodactyla where Olival et al. have "Cetartiodactyla" - have changed this
mammal.orders <- stringi::stri_trans_totitle(c( "CINGULATA", "PILOSA","DIDELPHIMORPHIA", "EULIPOTYPHLA",
                "CHIROPTERA", "PRIMATES", "RODENTIA", "CARNIVORA", "LAGOMORPHA", "PROBOSCIDEA", "DIPROTODONTIA",
                 "ARTIODACTYLA", "PERISSODACTYLA",  "PERAMELEMORPHIA", "SCANDENTIA"))
pathogen_vs_mammal_db <- pathogen_vs_host_db_no_humans[which(pathogen_vs_host_db_no_humans$HostOrder %in% mammal.orders),] 
human_pathogens <- pathogen_vs_host_db$Species[which(pathogen_vs_host_db$HostSpeciesPHB=="Homosapiens")]
pathogen_vs_mammal_db$pathogenSharedWithHumans <- ifelse(pathogen_vs_mammal_db$Species %in% human_pathogens,
                                                         "yes", "no")
boxplot.df <- data.frame(pathogen_vs_mammal_db %>% 
                           group_by(HostSpeciesPHB, HostOrder, Type) %>%
                           summarise(nPathogens=length(Species),
                                     nHumanPathogens=sum(pathogenSharedWithHumans == "yes"),
                                     nWildMammalHosts=sum(Domestic == "No")))
boxplot.df$propHumanPathogens <- boxplot.df$nHumanPathogens/boxplot.df$nPathogens
boxplot.df$propWildMammalHosts <- boxplot.df$nWildMammalHosts/boxplot.df$nPathogens

# Add information on domesticity: is host species domestic or not?
# Need to check this as they can be different for each association 
# i.e. is a species completely wild, completely domestic, or a mixture?
host.df <- pathogen_vs_host_db[which(!duplicated(pathogen_vs_host_db$HostSpeciesPHB)),]
host.df.domestic <- host.df$Domestic
names(host.df.domestic) <- host.df$HostSpeciesPHB
domestic.mammals <- names(host.df.domestic)[which(host.df.domestic=="Yes")]
wild.mammals <- names(host.df.domestic)[which(host.df.domestic=="No")]
captive.mammals <- names(host.df.domestic)[which(host.df.domestic=="Captive")]


# Order hosts using Olival et al. ordering
boxplot.df$HostOrder <- ordered(boxplot.df$HostOrder, levels=mammal.orders)

# Proportion of zoonotic pathogens for viruses
p.virus.zoonotic <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Virus"),], aes( HostOrder, propHumanPathogens))+
  geom_boxplot(outlier.shape = NA, fill="#fee5d9")+
  theme_basic()+coord_flip()+
  geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
  scale_fill_continuous(low="white", high="red")+
  scale_x_discrete(drop=FALSE)+
  theme(legend.position = "none")+
  ylab("")+xlab("")+
  ylab("Proportion of zoonotic viruses")+
  ggtitle("(a)")+
  theme(plot.title=element_text(hjust=0))

p.virus.richness <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Virus"),], aes( HostOrder, nPathogens))+
  geom_boxplot(outlier.shape = NA, fill="#fee5d9")+
  theme_basic()+coord_flip()+
  geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
  scale_fill_continuous(low="white", high="red", limits=c(0,1))+
  ylab("")+xlab("")+
  theme(legend.position = "none")+
  theme(axis.text.y=element_blank())+
  ylab("Total viral richness")+
    ggtitle("(b)")+
  theme(plot.title=element_text(hjust=0))+
  ylim(c(0,60))+
    labs(fill="Proportion of wild hosts")+
  theme(legend.position = c(0.7,0.15),
        legend.title = element_text(size=14), 
        legend.background = element_rect(colour="black"))

p.bacteria.zoonotic <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Bacteria"),], aes( HostOrder, propHumanPathogens))+
  geom_boxplot(outlier.shape = NA, fill="#f7f7f7")+
  theme_basic()+coord_flip()+
  geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
  scale_x_discrete(drop=FALSE)+
  scale_fill_continuous(low="white", high="black")+
  ylab("")+xlab("")+
  theme(legend.position = "none")+
  ylab("Proportion of zoonotic bacteria")+
    ggtitle("(c)")+
  theme(plot.title=element_text(hjust=0))

p.bacteria.richness <- ggplot(data=boxplot.df[which(boxplot.df$Type=="Bacteria"),], aes( HostOrder, nPathogens))+
  geom_boxplot(outlier.shape = NA, fill="#f7f7f7")+
  theme_basic()+coord_flip()+
  geom_jitter(aes(fill=propWildMammalHosts), colour="black", shape=21, width=0.15, height = 0, size=2)+
  scale_fill_continuous(low="white", high="black", limits=c(0,1))+
  ylab("")+xlab("")+
  scale_x_discrete(drop=FALSE)+
  theme(legend.position = "none")+
  theme(axis.text.y=element_blank())+
  ylab("Total bacterial richness")+
    ggtitle("(d)")+
  theme(plot.title=element_text(hjust=0))+
  ylim(c(0,60))+
  labs(fill="Proportion of wild hosts")+
  theme(legend.position = c(0.7,0.15),
        legend.title = element_text(size=14), 
        legend.background = element_rect(colour="black"))

pdf(file="figures/Figure-3-richness-per-species.pdf", width=20, height=9)
cowplot::plot_grid(p.virus.zoonotic, p.virus.richness,
                   p.bacteria.zoonotic, p.bacteria.richness, nrow=1)
dev.off()
## quartz_off_screen 
##                 2
png(file="figures/Figure-3-richness-per-species.png", width=2000, height=900)
cowplot::plot_grid(p.virus.zoonotic, p.virus.richness,
                   p.bacteria.zoonotic, p.bacteria.richness, nrow=1)
dev.off()
## quartz_off_screen 
##                 2
# Show plot
cowplot::plot_grid(p.virus.zoonotic, p.virus.richness,
                   p.bacteria.zoonotic, p.bacteria.richness, nrow=1)

# Correlation of bacterial and viral richness per host order
boxplot.df.bacteria <- boxplot.df[which(boxplot.df$Type=="Bacteria"),]
boxplot.df.virus <- boxplot.df[which(boxplot.df$Type=="Virus"),]
# Shared hosts i.e. have both bacteria and virus associations
shared.hosts <- boxplot.df.virus$HostSpeciesPHB[which(boxplot.df.virus$HostSpeciesPHB %in% boxplot.df.bacteria$HostSpeciesPHB)]
boxplot.df.shared <- boxplot.df[which(boxplot.df$HostSpeciesPHB %in% shared.hosts),]
# Now they are sorted, so can easily separate
mammal.bacterial.viral.richness <- data.frame(species=shared.hosts,
                                              nBacteria=boxplot.df.shared$nPathogens[which(boxplot.df.shared$Type=="Bacteria")],
                                              propBacteriaSharedHumans=boxplot.df.shared$propHumanPathogens[which(boxplot.df.shared$Type=="Bacteria")],
                                              nVirus=boxplot.df.shared$nPathogens[which(boxplot.df.shared$Type=="Virus")],
                                              propVirusSharedHumans=boxplot.df.shared$propHumanPathogens[which(boxplot.df.shared$Type=="Virus")])
cor.test(mammal.bacterial.viral.richness$nBacteria, mammal.bacterial.viral.richness$nVirus)
## 
##  Pearson's product-moment correlation
## 
## data:  mammal.bacterial.viral.richness$nBacteria and mammal.bacterial.viral.richness$nVirus
## t = 35.68, df = 357, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8587839 0.9045045
## sample estimates:
##       cor 
## 0.8837353

Figure 4

First we do GAMs for host traits which predict viral and bacterial richness.

options(warn=-1) # Turn off warnings
source('scripts/04-fit-GAMs-host-traits.R')
source('scripts/05-plot-GAMs-host-traits.R')
allplots

Figure 5

Then GAMs for predicting zoonotic potential.

detachAllPackages <- function() {

  basic.packages <- c("package:stats","package:graphics","package:grDevices","package:utils","package:datasets","package:methods","package:base")

  package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]

  package.list <- setdiff(package.list,basic.packages)

  if (length(package.list)>0)  for (package in package.list) detach(package, character.only=TRUE)

}
detachAllPackages()
options(warn=-1)
library(stringi)
library(parallel)
library(magrittr)
library(purrr)
library(mgcv)
library(ggplot2)
library(cowplot)
library(viridis)
library(svglite)
library(phangorn)
library(MuMIn)
library(dplyr)

source('scripts/06-fit-GAM-viral-zoonotic-potential.R')
source('scripts/07-fit-GAM-bacterial-zoonotic-potential.R')
allplots = cowplot::plot_grid(allplots_viral, allplots_bacterial, nrow=2, rel_widths = c(5.3, 5.3, 5.3))

# Save plots
png(file="figures/Figure-5-gams-zoonotic-potential.png", width = convertr::convert(183, "mm", "in")*500, convertr::convert(100, "mm", "in")*600, pointsize=7, res=600)
allplots
dev.off()
## quartz_off_screen 
##                 2
pdf(file="figures/Figure-5-gams-zoonotic-potential.pdf", width =7, height=5, pointsize=7)
allplots
dev.off()
## quartz_off_screen 
##                 2
# Show plot
allplots

Figure 6

We make a plot of the host switching for bacteria and viruses together (left panel of figure). We restrict

hostShiftingResultsOrder <- function(type="Bacteria", min.rep=0){
   metadata.type <- metadata[which(metadata$Type %in% type),]
  # Get host orders, and keep only those with more than min.rep representatives
  type.host.orders.t <- table(metadata.type$HostOrder)
  type.host.orders <- names(type.host.orders.t[type.host.orders.t>min.rep])
  # Prepare results to fill
  results <- matrix(nrow=0, ncol=8)
  # For each host order
  for (reference.order in type.host.orders){
    reference.order.pathogens <- unique(metadata.type[which(metadata.type$HostOrder==reference.order), "Species"])
    reference.order.species <- unique(as.character(metadata.type[which(metadata.type$HostOrder==reference.order), "HostSpeciesPHB"]))
    # For each host order, return percentage of pathogens shared with host order vs. mean distance 
    for (comparison.order in type.host.orders){
      #  All pathogens in that host order
      pathogens <- metadata.type[which(metadata.type$HostOrder==comparison.order),"Species"]
      # Percentage of those pathogens that are shared with reference order
      perc.shared <- length(pathogens[pathogens %in% reference.order.pathogens])/length(pathogens)
      perc.shared.unique <- length(unique(pathogens)[unique(pathogens) %in% reference.order.pathogens])/length(unique(c(pathogens, reference.order.pathogens)))
      # Mean of all host order in order from reference order
      comparison.order.species <- unique(as.character(metadata.type[which(metadata.type$HostOrder==comparison.order),"HostSpeciesPHB"]))
      mean.distance <- mean(d[comparison.order.species, reference.order.species])
      sd.distance <- sd(d[comparison.order.species, reference.order.species])
      # Add to results 
      if (reference.order==comparison.order){
        mean.distance <- NA
        sd.distance <- NA
        perc.shared <- NA
        perc.shared.unique <- NA
      }
      results <- rbind(results, c(reference.order,
                                  comparison.order,
                                  mean.distance, 
                                  sd.distance, 
                                  perc.shared, 
                                  perc.shared.unique, 
                                  length(reference.order.species), 
                                  length(comparison.order.species) ))
    }
  }
  
  # Naming of data frame, add names of host orders
  results <- as.data.frame(results, stringsAsFactors=FALSE)
  colnames(results) <- c("order.1",
                         "order.2",
                         "mean.distance", 
                         "sd.distance", 
                         "perc.shared", 
                         "perc.shared.unique",
                         "n.associations.order.1", 
                         "n.associations.order.2")
  results$perc.shared <- as.numeric(results$perc.shared)
  results$perc.shared.unique <- as.numeric(results$perc.shared.unique)

  results$mean.distance <- as.numeric(results$mean.distance)
  results$sd.distance <- as.numeric(results$sd.distance)
  return(results)
}
hostShiftingResultsFamily <- function(type="Bacteria", min.rep=0){
   metadata.type <- metadata[which(metadata$Type %in% type),]
  # Get host families, and keep only those with more than min.rep representatives
  type.host.families.t <- table(metadata.type$HostFamily)
  type.host.families <- names(type.host.families.t[type.host.families.t>min.rep])
  # Get list of host.species 
  host.table <- table(metadata.type$HostSpeciesPHB)
  host.species <- names(host.table)[which(host.table>min.rep)]
  results <- matrix(nrow=0, ncol=8)
  # For each host family
  for (reference.family in type.host.families){
    reference.family.pathogens <- unique(metadata.type[which(metadata.type$HostFamily==reference.family), "Species"])
    reference.family.species <- unique(as.character(metadata.type[which(metadata.type$HostFamily==reference.family), "HostSpeciesPHB"]))
    # For each host order, return percentage of pathogens shared with host family vs. mean distance 
    for (comparison.family in type.host.families){
      #  All pathogens in that host family
      pathogens <- metadata.type[which(metadata.type$HostFamily==comparison.family),"Species"]
      # Percentage of those pathogens that are shared with reference order
      perc.shared <- length(pathogens[pathogens %in% reference.family.pathogens])/length(pathogens)
      perc.shared.unique <- length(unique(pathogens)[unique(pathogens) %in% reference.order.pathogens])/length(unique(c(pathogens, reference.order.pathogens)))
      # Mean of all host order in order from reference order
      comparison.family.species <- unique(as.character(metadata.type[which(metadata.type$HostFamily==comparison.family),"HostSpeciesPHB"]))
      mean.distance <- mean(d[comparison.family.species, reference.family.species])
      sd.distance <- sd(d[comparison.family.species, reference.family.species])
      # Add to results
      results <- rbind(results, c(reference.family,
                                  comparison.family,
                                  mean.distance, 
                                  sd.distance, 
                                  perc.shared, 
                                  perc.shared.unique, 
                                  length(metadata.type[which(metadata.type$HostFamily==reference.family), "HostSpeciesPHB"]),
                                  length(metadata.type[which(metadata.type$HostFamily==comparison.family), "HostSpeciesPHB"]) ))
    }
  }
  
  # Naming of data frame, add names of host orders
  results <- as.data.frame(results, stringsAsFactors=FALSE)
  colnames(results) <- c("family.1",
                         "family.2",
                         "mean.distance", 
                         "sd.distance", 
                         "perc.shared", 
                         "perc.shared.unique",
                         "n.associations.family.1", 
                         "n.associations.order.2")
  results$perc.shared <- as.numeric(results$perc.shared)
  results$perc.shared.unique <- as.numeric(results$perc.shared.unique)

  results$mean.distance <- as.numeric(results$mean.distance)
  results$sd.distance <- as.numeric(results$sd.distance)
  return(results)
}

hostShiftingResultsSpecies <- function(type="Bacteria", min.rep=0){
   metadata.type <- metadata[which(metadata$Type %in% type),]
  # Get host orders, and keep only those with more than min.rep representatives
  type.host.species.t <- table(metadata.type$HostSpeciesPHB)
  type.host.species <- names(type.host.species.t[type.host.species.t>min.rep])
  # Get list of host.species 
  host.species <- type.host.species
  results <- matrix(nrow=0, ncol=8)
  # For each host species
  for (reference.species in host.species){
    reference.species.pathogens <- unique(metadata.type[which(metadata.type$HostSpeciesPHB==reference.species), "Species"])
    # For each host species, return percentage of pathogens shared with host species vs. mean distance 
    for (comparison.species in host.species){
      #  All pathogens in that host species
      pathogens <- metadata.type[which(metadata.type$HostSpeciesPHB==comparison.species),"Species"]
      # Percentage of those pathogens that are shared with reference species
      perc.shared <- length(pathogens[pathogens %in% reference.species.pathogens])/length(pathogens)
      perc.shared.unique <- length(unique(pathogens)[unique(pathogens) %in% reference.species.pathogens])/length(unique(c(pathogens, reference.species.pathogens)))
      # Mean of all host species in order from reference order
      mean.distance <- mean(d[comparison.species, reference.species])
      sd.distance <- sd(d[comparison.species, reference.species])
      # Add to results
      results <- rbind(results, c(reference.species,
                                  comparison.species,
                                  mean.distance, 
                                  sd.distance, 
                                  perc.shared, 
                                  perc.shared.unique, 
                                  length(reference.species), 
                                  length(comparison.species) ))
    }
  }
  
  # Naming of data frame, add names of host orders
  results <- as.data.frame(results, stringsAsFactors=FALSE)
  colnames(results) <- c("species.1",
                         "species.2",
                         "mean.distance", 
                         "sd.distance", 
                         "perc.shared", 
                         "perc.shared.unique",
                         "n.hosts.species.1", 
                         "n.hosts.species.2")
  results$perc.shared <- as.numeric(results$perc.shared)
  results$perc.shared.unique <- as.numeric(results$perc.shared.unique)

  results$mean.distance <- as.numeric(results$mean.distance)
  results$sd.distance <- as.numeric(results$sd.distance)
  return(results)
}

hostShiftingPlot <- function(results, title.string=""){
  # Correlation test
  correlation <- cor.test(results$mean.distance, results$perc.shared, method = "spearman")
  # Label for correlation statistics
  corr.label <- data.frame(
    mean.distance = 0,
    perc.shared = 0.1, 
    label = paste("Spearman's rho = ", myround(correlation$estimate, 3), 
                  "\n p = ", myround(correlation$p.value, 3)),
    host.order="",
    n.hosts="NA*")
  # Plot results
  p <- ggplot(results, aes(x=mean.distance, y=perc.shared.unique))+
    geom_point( fill="black", alpha=0.2)+
    stat_smooth(method="loess", aes(group=1), se = FALSE, size=2, alpha=0.5)+
    xlim(c(0,max(results$mean.distance)))+
    ylim(c(0,1))+
    xlab(paste("Mean phylogenetic distance"))+
    ylab(paste("Fraction of shared pathogens"))+
    theme_basic()+
    labs(size="Number of host-pathogen associations")+
    theme(axis.text=element_text(size=16),
          legend.text =element_text(colour="black", size=16),
          legend.title=element_text(colour="black", size=18),
          strip.text.x=element_text(colour="black", size=22),
          axis.title=element_text(colour="black", size=22))
  return(p)
}
metadata <- pathogen_vs_host_db
d <- cophenetic_matrix

results.orders.10 <- hostShiftingResultsOrder(type=c("Bacteria", "Virus"), min.rep=10)
p.pathogen.sharing.orders.10 <- hostShiftingPlot(results.orders.10)+xlab("Mean phylogenetic distance between orders")

results.orders.5 <- hostShiftingResultsOrder(type=c("Bacteria", "Virus"), min.rep=5)
p.pathogen.sharing.orders.5 <- hostShiftingPlot(results.orders.5)

results.species.10 <- hostShiftingResultsSpecies(type=c("Bacteria", "Virus"), min.rep=10)
# plot, but exclude comparisons of species with themselves, and the outliers
p.pathogen.sharing.species.10 <- hostShiftingPlot(results.species.10[which(results.species.10$mean.distance!=0 &
                                                                             results.species.10$mean.distance<1.8),])

We also carry out a Mantel test for correlation of the two distance matrices.

mean.dist <- acast(species.1 ~ species.2, value.var = "mean.distance", data=results.species.10)
perc.shared <- acast(species.1 ~ species.2, value.var = "perc.shared.unique", data=results.species.10)
mantel.test(mean.dist, perc.shared)
## $z.stat
## [1] 303.546
## 
## $p
## [1] 0.001
## 
## $alternative
## [1] "two.sided"

We construct the right panel of the figure. We make a separate data frame for bacteria and for viruses, then combine these together and plot the results.

host.orders <- table(metadata$HostOrder)
min.rep.10 <- names(host.orders)[host.orders>10]
hostShiftingPlotDataHuman <- function(type=c("Bacteria", "Virus"), reference.species="Homosapiens", host.orders = min.rep.10, title.string=""){
  # Limit to only pathogens of interest (bacteria, viruses, or both)
  metadata.type <- metadata[which(metadata$Type %in% type),]
  # Get host orders, and keep only those with more than min.rep representatives
  type.host.orders <- min.rep.10
  # Get list of reference.species pathogens
  reference.species.pathogens <- unique(metadata.type[which(metadata.type$HostSpeciesPHB==reference.species), "Species"])
  # For each host order, return percentage of pathogens shared with humans vs. mean distance of host species from reference species
  results <- matrix(nrow=0, ncol=3)
  for (b in type.host.orders){
    #  All pathogens in that host order, excluding the reference species
    pathogens <- unique(metadata.type[which(metadata.type$HostOrder==b & metadata.type$HostSpeciesPHB!=reference.species),"Species"])
    # Percentage of those pathogens that are shared with humans
    perc.shared <- length(pathogens[pathogens %in% reference.species.pathogens])/length(pathogens)
    # Mean of all host species in order from reference species. Note that we consider all species, 
    # not just those that have a pathogen association of the right type
    host.species <- as.character(metadata[which(metadata$HostOrder==b),"HostSpeciesPHB"])
    mean.distance <- mean(d[host.species, reference.species])
    # Add to results
    results <- rbind(results, c(perc.shared, mean.distance, length(unique(host.species))))
  }
  # Naming of data frame, add names of host orders
  results <- as.data.frame(results)
  colnames(results) <- c("perc.shared", "mean.distance", "n.hosts")
  results$host.order <- type.host.orders
  # Correlation test
  correlation <- cor.test(results$mean.distance, results$perc.shared, method = "spearman")
  # Label for correlation statistics
  corr.label <- data.frame(
    mean.distance = 0,
    perc.shared = 0.1, 
    label = paste("Spearman's rho = ", myround(correlation$estimate, 3), 
                  "\n p = ", myround(correlation$p.value, 3)),
    host.order="",
    n.hosts="NA*")
  # Return results
  return(results)
}

human.sharing.bacteria <- hostShiftingPlotDataHuman("Bacteria")
human.sharing.bacteria$type <- "Bacteria"
human.sharing.virus <- hostShiftingPlotDataHuman("Virus")
human.sharing.virus$type <- "Virus"
plot.df <- rbind(human.sharing.bacteria, human.sharing.virus)

# The parameter starting values for the sigmoidal fit are from fitting the sigmoid the bacteria data only with nls:
# fit <- nls(perc.shared ~ SSlogis(mean.distance, Asym, xmid, scal), data = human.sharing.bacteria)
p.human.sharing <- ggplot(plot.df, aes(x=mean.distance, y=perc.shared))+
  geom_point(aes(size=n.hosts, fill=type), colour="black", pch=21, alpha=0.5)+
  xlim(c(0,max(plot.df$mean.distance)))+
  geom_smooth(method="nls", aes(group=type, colour=type),size=2,   formula= y ~ Asym/(1+exp((xmid-x)/scal)), method.args = list(start=c(Asym=0.7, xmid=1.31, scal=-0.33)), se=FALSE)+
  scale_fill_manual(values=c("black", "red"))+
  scale_colour_manual(values=c("black", "red"))+
  scale_size_continuous(range = c(2,10))+
  ylim(c(0,1))+
  labs(size="Number of\nhost-pathogen\nassociations", fill="Pathogen type", colour="Pathogen type")+
  xlab(paste("Mean phylogenetic distance to Homo sapiens"))+
  ylab(paste("Fraction of shared pathogens with Homo sapiens"))+
  theme_basic()+
  theme(axis.text=element_text(size=16),
        legend.text =element_text(colour="black", size=16),
        legend.title=element_text(colour="black", size=18),
        strip.text.x=element_text(colour="black", size=22),
        axis.title=element_text(colour="black", size=22))

We combine the left and right panels together to make a preliminary version of the figure. The paper version was produced manually from this preliminary version in Inkscape, using animal silhouettes from phylopic.org.

pdf("figures/Figure-6-pathogen-sharing.pdf", width=17, height=10)
cowplot::plot_grid(p.pathogen.sharing.orders.10+ggtitle("(a)"), p.human.sharing+ggtitle("(b)"))
dev.off()
## quartz_off_screen 
##                 2
png("figures/Figure-6-pathogen-sharing.png", width=1700, height=1000, pointsize = 14)
cowplot::plot_grid(p.pathogen.sharing.orders.10, p.human.sharing)
dev.off()
## quartz_off_screen 
##                 2
# Show plot
cowplot::plot_grid(p.pathogen.sharing.orders.10, p.human.sharing)

It is useful to be able to match the points and names together, which can be done with the plot below.

ggplot(plot.df, aes(x=mean.distance, y=perc.shared))+
  geom_text(aes(label=host.order, colour=type), nudge_x = 0,nudge_y=0, size=1, angle=45)+
  stat_smooth(method="loess", aes(group=type, colour=type),size=2,  se = FALSE, formula = y ~ exp(x)/(exp(x)+1))+
  xlim(c(0,max(plot.df$mean.distance)))+
  scale_fill_manual(values=c("black", "red"))+
  scale_colour_manual(values=c("black", "red"))+
  scale_size_continuous(range = c(2,10))+
  ylim(c(0,1))+
  labs(size="Number of\nhost-pathogen\nassociations", fill="Pathogen type", colour="Pathogen type")+
  xlab(paste("Mean phylogenetic distance to Homo sapiens"))+
  ylab(paste("Fraction of shared pathogens with Homo sapiens"))+
  theme_basic()+geom_point(aes(colour=type))+facet_wrap(~type)

ggsave("figures/Figure-6-pathogen-sharing-right-panel-names.pdf")
## Saving 12 x 8 in image

Supplementary Figures

Supplementary Figure 1

We compare our mitochondrial gene phylogeny with the cytb phylogeny of Olival et al.

olival.tree <- read.tree('data/olival_cytb_supertree.tree')
# Convert tip labels
olival.tree$tip.label <- gsub("_", "", olival.tree$tip.label)
# Only consider shared species
olival.tree.shared <- drop.tip(olival.tree, 
                                              tip=olival.tree$tip.label[which(!olival.tree$tip.label %in% host_tree$tip.label)])
olival.tree.cophenetic <- cophenetic(olival.tree.shared)
this.study.tree.cophenetic <- d[rownames(olival.tree.cophenetic), rownames(olival.tree.cophenetic)]
# Plot cophenetic distances
tree.df <- data.frame(cbind(as.vector(olival.tree.cophenetic), as.vector(this.study.tree.cophenetic)))
p.olival.cophenetic.comparison <- ggplot(tree.df, aes(X1, X2))+
  geom_point(size=0.25, alpha=0.1)+
  theme_basic()+
  xlab("Cophenetic distance (Olival et al. 2017)")+
  ylab("Cophenetic distance (this study)")+
  geom_abline(linetype='dashed')+
  stat_smooth(method="lm", colour="red")
p.olival.cophenetic.comparison

png("figures/Supplementary-Figure-1-Cophenetic-Tree-Comparison.png", width=800, height=1200)
p.olival.cophenetic.comparison
dev.off()
## quartz_off_screen 
##                 2

Supplementary Figure 2

Mean and maximum PHB are well-correlated for most pathogens.

p.mean.max.phb <- ggplot(unique_pathogens, aes(x=PHB, y=PHB.max))+geom_point(aes(size=N.hosts, colour=Type))+theme_basic()+
  facet_wrap(~Type)+
  scale_color_manual(values=c("black", "red"))+
  geom_abline(slope=1, intercept=0)
p.mean.max.phb

pdf("figures/Supplementary-Figure-2-PHB-correlation.pdf", width=12, height=8)
p.mean.max.phb
dev.off()
## quartz_off_screen 
##                 2

Supplementary Figure 3

Shows how PHB varies within bacterial families.

# Family
min.abundance <- 20
df.generalist.prop.family <- data.frame(bacteria.df %>% group_by(family) %>% summarise(prop.generalist=length(n.hosts[which(n.hosts!=1)])/length(n.hosts), species=length(n.hosts)))
df.generalist.prop.family$family <- ordered(df.generalist.prop.family$family, levels=df.generalist.prop.family$family[order(df.generalist.prop.family$prop.generalist, decreasing=TRUE)])

p.bacteria.family.generalist <- ggplot(df.generalist.prop.family[which(df.generalist.prop.family$species>min.abundance & df.generalist.prop.family$prop.generalist!=0),], aes(family, prop.generalist, size=species))+
  geom_point()+
  theme_basic()+
  xlab("")+
  ylab("Proportion of generalists")+
  labs(size="Number\nof species")+
  ylim(c(0,1))+
  coord_flip()+
  theme(legend.position = "none")+
  ggtitle("(a) Proportion of generalists")+
    theme(plot.margin=unit(c(1,0.5,1,1), "cm"))+
  theme(legend.position = "none")


# Get these top families
top.families <- as.character(df.generalist.prop.family$family[which(df.generalist.prop.family$species>min.abundance)])
plot.df <- bacteria.df[which(bacteria.df$family %in% top.families & bacteria.df$bacteria.phbs!=0),]
shared.families <- unique(plot.df$family[which(plot.df$family %in% top.families)])
plot.df$family <- ordered(plot.df$family, levels=levels(df.generalist.prop.family$family))

p.bacteria.family.phb <- ggplot(plot.df, aes(family, bacteria.phbs))+
  geom_boxplot(outlier.shape=NA)+
  geom_jitter(width=0.05, height=0, pch=21, fill='white', colour='black')+
  theme_basic()+
  xlab("")+
  ylab("PHB")+
  ylim(c(0,1))+
  coord_flip()+
  theme(axis.text.y=element_blank())+
  ggtitle("(b) PHB of generalists")+
    theme(plot.margin=unit(c(1,1,1,-0.5), "cm"))

cowplot::plot_grid(p.bacteria.family.generalist, 
             p.bacteria.family.phb, 
             ncol=2)

pdf("figures/Supplementary-Figure-3-bacterial-families.pdf", width=12, height=8)
cowplot::plot_grid(p.bacteria.family.generalist, 
             p.bacteria.family.phb, 
             ncol=2)
dev.off()
## quartz_off_screen 
##                 2

Supplementary Figure 4

The same approach as Supplementary Figure 3, but now for viral families

# Family
min.abundance <- 20
df.generalist.prop.family <- data.frame(virus.df %>% group_by(family) %>% summarise(prop.generalist=length(n.hosts[which(n.hosts!=1)])/length(n.hosts), species=length(n.hosts)))
df.generalist.prop.family$family <- ordered(df.generalist.prop.family$family, levels=df.generalist.prop.family$family[order(df.generalist.prop.family$prop.generalist, decreasing=TRUE)])

p.virus.family.generalist <- ggplot(df.generalist.prop.family[which(df.generalist.prop.family$species>min.abundance & df.generalist.prop.family$prop.generalist>0.05),], aes(family, prop.generalist, size=species))+
  geom_point()+
  theme_basic()+
  xlab("")+
  ylab("Proportion of generalists")+
  labs(size="Number\nof species")+
  ylim(c(0,1))+
  coord_flip()+
  ggtitle("(a) Proportion of generalists")+
  theme(plot.margin=unit(c(1,0.5,1,1), "cm"))+
  theme(legend.position = "none")

# Get these top families
top.families <- as.character(df.generalist.prop.family$family[which(df.generalist.prop.family$species>min.abundance)])
plot.df <- virus.df[which(virus.df$family %in% top.families & virus.df$virus.phbs!=0),]
shared.families <- unique(plot.df$family[which(plot.df$family %in% top.families)])
plot.df$family <- ordered(plot.df$family, levels=levels(df.generalist.prop.family$family))

# family overlap
family.overlap <- unique(as.character(df.generalist.prop.family[which(df.generalist.prop.family$species>min.abundance & df.generalist.prop.family$prop.generalist!=0),"family"]))
p.virus.family.phb <- ggplot(plot.df[which(plot.df$family %in% family.overlap),], aes(family, virus.phbs))+
  geom_boxplot(outlier.shape=NA)+
  geom_jitter(width=0.05, height=0, pch=21, fill='white', colour='black')+
  theme_basic()+
  xlab("")+
  ylab("PHB")+
  coord_flip()+
  theme(axis.text.y=element_blank())+
  ggtitle("(b) PHB of generalists")+
  theme(plot.margin=unit(c(1,1,1,-0.5), "cm"))

cowplot::plot_grid(p.virus.family.generalist, 
             p.virus.family.phb, 
             ncol=2)

pdf("figures/Supplementary-Figure-4-viral-families.pdf", width=12, height=8)
cowplot::plot_grid(p.virus.family.generalist, 
             p.virus.family.phb, 
             ncol=2)
dev.off()
## quartz_off_screen 
##                 2

Supplementary Figure 5

How do genome GC content and size vary in specialist vs. generalist pathogens?

bacteria.unique <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Bacteria" & !duplicated(pathogen_vs_host_db$Species)),]
rownames(bacteria.unique) <- bacteria.unique$Species
bacteria.unique$PHB <- sapply(bacteria.unique$Species, function(x) PHB(x, m=pathogen_vs_host_db))
virus.unique <- pathogen_vs_host_db[which(pathogen_vs_host_db$Type=="Virus" & !duplicated(pathogen_vs_host_db$Species)),]
virus.unique$PHB <- sapply(virus.unique$Species, function(x) PHB(x, m=pathogen_vs_host_db))

bacteria.unique$Specialist <- ordered(ifelse(bacteria.unique$PHB>0, "Generalist", "Specialist"), 
                                      levels=c("Specialist", "Generalist"))
virus.unique$Specialist <- ordered(ifelse(virus.unique$PHB>0, "Generalist", "Specialist"), 
                                   levels=c("Specialist", "Generalist"))

p.bacteria.GC <- ggplot(bacteria.unique, aes(Specialist, Genome.GC, fill=Specialist))+
  geom_violin()+
  stat_summary(geom = "point",pch=1,  fun.y = "median", size=5)+
  stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
  theme_basic()+ 
  scale_fill_manual(values=c("#f0f0f0", "#636363"))+
  ylim(c(20, 80))+
  xlab("")+
  ylab("GC content (%)")+
  theme(legend.position="none")+
  ggtitle("Bacteria")+
  theme(plot.title=element_text(size=24, hjust=0.5))

p.virus.GC <- ggplot(virus.unique, aes(Specialist, Genome.GC, fill=Specialist))+
  geom_violin()+
  stat_summary(geom = "point",pch=1,  fun.y = "median", size=5)+
  stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
  theme_basic()+ 
  scale_fill_manual(values=c("#fee0d2", "#de2d26"))+
  ylim(c(20, 80))+
  xlab("")+
  ylab("GC content (%)")+
  theme(legend.position="none")+
  ggtitle("Viruses")+
  theme(plot.title=element_text(size=24, hjust=0.5))


p.bacteria.size <- ggplot(bacteria.unique, aes(Specialist, Genome.size, fill=Specialist))+
  geom_violin()+
  stat_summary(geom = "point",pch=1,  fun.y = "median", size=5)+
  stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
  theme_basic()+ 
  scale_fill_manual(values=c("#f0f0f0", "#636363"))+
  scale_y_log10()+
  ylab("Log(genome size in Mb)")+
  xlab("")+
  ylab("Log(genome size in Mb)")+
  theme(legend.position="none")

p.virus.size <- ggplot(virus.unique, aes(Specialist, Genome.size, fill=Specialist))+
  geom_violin()+
  stat_summary(geom = "point",pch=1,  fun.y = "median", size=5)+
  stat_summary(geom = "point", pch=2, fun.y = "mean", size=5)+
  theme_basic()+ 
  scale_y_log10()+
  scale_fill_manual(values=c("#fee0d2", "#de2d26"))+
  xlab("")+
  ylab("Log(genome size in Mb)")+
  theme(legend.position="none")

cowplot::plot_grid(p.bacteria.GC, p.virus.GC, p.bacteria.size, p.virus.size, ncol=2, widths=c(1,1))

# Save figure
pdf("figures/Supplementary-Figure-5-GC-content-genome-size.pdf", width=8, height=8)
cowplot::plot_grid(p.bacteria.GC, p.virus.GC, p.bacteria.size, p.virus.size, ncol=2, widths=c(1,1))
dev.off()
## quartz_off_screen 
##                 2

Effect of genome size of bacteria.

bacteria.specialist.genome.sizes <- bacteria.unique$Genome.size[which(bacteria.unique$PHB==0)]
bacteria.generalist.genome.sizes <- bacteria.unique$Genome.size[which(bacteria.unique$PHB!=0)]
# Missing data - number of associations kept
nrow(pathogen_vs_host_db[which(pathogen_vs_host_db$Species %in% rownames(bacteria.unique)[which(!is.na(bacteria.unique$Genome.size) & !is.na(bacteria.unique$Genome.GC))]),])
## [1] 4619
table(is.na(bacteria.specialist.genome.sizes))
## 
## FALSE  TRUE 
##   827   466
table(is.na(bacteria.generalist.genome.sizes))
## 
## FALSE  TRUE 
##   308    84
wilcox.test(x = na.omit(bacteria.specialist.genome.sizes), 
            y = na.omit(bacteria.generalist.genome.sizes))
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  na.omit(bacteria.specialist.genome.sizes) and na.omit(bacteria.generalist.genome.sizes)
## W = 140606, p-value = 0.00698
## alternative hypothesis: true location shift is not equal to 0

Supplementary Figure 6

Plot of virus PHB for different genome types (RNA/DNA). We use the Baltimore classification (type of genome and method of classification).

options(warn=-1) # Turn off warnings
# Virus genome type
virus.df$genome.type <- ordered(virus.df$genome.type, levels=c("ssRNA-RT", "(-) ssRNA", "(+) ssRNA", "dsRNA", "dsDNA-RT", "dsDNA"))
p.virus<- ggplot(virus.df[!is.na(virus.df$genome.type),], aes(genome.type, virus.phbs, colour=genome.type.rna.dna))+
  geom_boxplot(outlier.shape=NA)+geom_jitter(height=0, width=0.25, alpha=0.8)+
  theme_basic()+
  xlab("")+
  ylab("Mean PHB")+
  scale_color_manual(values=c("#377eb8", "#4daf4a"))+
  labs(colour="Genomic material")+
  ylim(c(0,1))+
  theme(axis.text=element_text(size=16),
        legend.text =element_text(colour="black", size=16),
        legend.title=element_text(colour="black", size=18),
        strip.text.x=element_text(colour="black", size=22),
        axis.title=element_text(colour="black", size=22))
p.virus

# Save figure
pdf("figures/Supplementary-Figure-6-virus-genome-type.pdf", width=12, height=8)
p.virus
dev.off()
## quartz_off_screen 
##                 2

Supplementary Figure 7

We plot the effect of bacterial lifestyle on the proportion of specialists vs. generalists.

bacteria.df$gram <- as.character(bacteria.df$gram)
bacteria.df$gram[which(is.na(bacteria.df$gram) | bacteria.df$gram %in% c("na", ""))] <- "Unknown"
# Manually amend lactobacillus spp and 
bacteria.df[which(bacteria.df$bacteria.names=="Lactobacillus spp"),"gram"] <- "Gram-positive"
bacteria.df[which(bacteria.df$bacteria.names=="Segniliparus rugosus"),"gram"] <- "Gram-positive"

gram.plot.df <- data.frame(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)) * 100)
gram.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)), 
                                   Var1=rownames(table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)),
                                   Freq=3)
gram.plot <- ggplot(gram.plot.df[which(!gram.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
  geom_bar(position="stack", stat="identity", colour="black")+
  theme_basic()+
  scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
  guides(fill=FALSE)+
  xlab("")+
  ylab("Within-category percentage (%)")+
  theme(legend.text=element_text(size=14))+
  ggtitle("Gram stain")+
  theme(plot.title=element_text(size=25, hjust=0.5))+
  geom_label(data=gram.plot.N.total.df[which(!gram.plot.df$Var1 %in% c("", "na")),], aes(label=N), fill=NA, size=6)


bacteria.df$motility <- as.character(bacteria.df$motility)
bacteria.df$motility[which(is.na(bacteria.df$motility) | bacteria.df$motility %in% c("na", ""))] <- "Unknown"
motile.plot.df <- data.frame(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0))*100)
motile.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)), 
                                     Var1=rownames(table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)),
                                     Freq=5)
motile.plot <- ggplot(motile.plot.df[which(!motile.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
  geom_bar(position="stack", stat="identity", colour="black")+
  theme_basic()+
  scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
  guides(fill=FALSE)+
  xlab("")+
  ylab("Within-category percentage (%)")+
  theme(legend.text=element_text(size=14))+
  ggtitle("Motility")+
  theme(plot.title=element_text(size=25, hjust=0.5))+
  geom_label(data=motile.plot.N.total.df[which(!motile.plot.df$Var1 %in% c("", "na")),], aes(label=N), fill=NA, size=6)

bacteria.df$cell <- as.character(bacteria.df$cell)
bacteria.df$cell[which(is.na(bacteria.df$cell))] <- "Unknown"
cell.plot.df <- data.frame(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)) * 100)
cell.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)), 
                                   Var1=rownames(table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)),
                                   Freq=5)
cell.plot <- ggplot(cell.plot.df[which(!cell.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
  geom_bar(position="stack", stat="identity", colour="black")+
  theme_basic()+
  scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
  guides(fill=FALSE)+
  xlab("")+
  ylab("Within-category percentage (%)")+
  theme(legend.text=element_text(size=14))+
  ggtitle("Cellular lifestyle")+
  theme(plot.title=element_text(size=25, hjust=0.5))+
  geom_label(data=cell.plot.N.total.df, aes(label=N), fill=NA, size=6)

bacteria.df$oxygen <- as.character(bacteria.df$oxygen)
# Manually correct Lactobacilli
bacteria.df$oxygen[which(bacteria.df$bacteria.names=="Lactobacillus spp")] <- "Facultatively anaerobic"
oxygen.plot.df <- data.frame(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)) * 100)
oxygen.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)), 
                                     Var1=rownames(table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)),
                                     Freq=5)
oxygen.plot <- ggplot(oxygen.plot.df[which(!oxygen.plot.df$Var1 %in% c( "Capnophilic")),], aes(Var1, Freq, fill=Var2))+
  geom_bar(position="stack", stat="identity", colour="black")+
  theme_basic()+
  scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
  guides(fill=FALSE)+
  xlab("")+
  ylab("Within-category percentage (%)")+
  theme(legend.text=element_text(size=14))+
  ggtitle("Oxygen requirements")+
  theme(plot.title=element_text(size=25, hjust=0.5))+
  geom_label(data=oxygen.plot.N.total.df[which(!oxygen.plot.N.total.df$Var1 %in% c( "Capnophilic")),], aes(label=N), fill=NA, size=6)

bacteria.df$spore <- as.character(bacteria.df$spore)
bacteria.df$spore[which(bacteria.df$spore %in% c("na", "") | is.na(bacteria.df$spore))] <- "Unknown"
bacteria.df$spore[which(bacteria.df$spore=="Yes")] <- "Spore-forming"
bacteria.df$spore[which(bacteria.df$spore=="No")] <- "Non-spore-forming"

spore.plot.df <- data.frame(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)/rowSums(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)) * 100)
spore.plot.N.total.df <- data.frame(N=rowSums(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)), 
                                    Var1=rownames(table(bacteria.df$spore, bacteria.df$bacteria.phbs>0)),
                                    Freq=5)
spore.plot <- ggplot(spore.plot.df[which(!spore.plot.df$Var1 %in% c("", "na")),], aes(Var1, Freq, fill=Var2))+
  geom_bar(position="stack", stat="identity", colour="black")+
  theme_basic()+
  scale_fill_manual(values=c("#fee5d9", "#fcae91"), labels=c("Specialist", "Generalist"))+
  guides(fill=FALSE)+
  xlab("")+
  ylab("Within-category percentage (%)")+
  theme(legend.text=element_text(size=14))+
  ggtitle("Spore formation")+
  theme(plot.title=element_text(size=25, hjust=0.5))+
  geom_label(data=spore.plot.N.total.df, aes(label=N), fill=NA, size=6)

cowplot::plot_grid(cell.plot, motile.plot, oxygen.plot, spore.plot, gram.plot, widths=c(4, 3), ncol=2)

# Save figure
pdf("figures/Supplementary-Figure-7-bacterial-lifestyle.pdf", width=14, height=18)
cowplot::plot_grid(cell.plot, motile.plot, oxygen.plot, spore.plot, gram.plot, widths=c(4, 3), ncol=2)
dev.off()
## quartz_off_screen 
##                 2

We now consider modelling the effect of motility and cellular behaviour together. No linear models show anything interesting.

# Motility
motility.table <- table(bacteria.df$motility, bacteria.df$bacteria.phbs>0)
motility.table <- motility.table[c("Motile", "Non-motile"),]
motility.table/rowSums(motility.table)
##             
##                  FALSE      TRUE
##   Motile     0.7276265 0.2723735
##   Non-motile 0.7832293 0.2167707
chisq.test(motility.table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  motility.table
## X-squared = 5.768, df = 1, p-value = 0.01632
require(knitr)
kable(motility.table)
FALSE TRUE
Motile 374 140
Non-motile 878 243
# Cellular
cellular.table <- table(bacteria.df$cell, bacteria.df$bacteria.phbs>0)
cellular.table <- cellular.table[as.character(unique(bacteria.df$cell))[1:3],]
cellular.table
##                            
##                             FALSE TRUE
##   Obligate intracellular       25   28
##   Facultative intracellular    48   45
##   Extracellular                82   79
chisq.test(cellular.table)
## 
##  Pearson's Chi-squared test
## 
## data:  cellular.table
## X-squared = 0.2932, df = 2, p-value = 0.8636
kable(cellular.table)
FALSE TRUE
Obligate intracellular 25 28
Facultative intracellular 48 45
Extracellular 82 79
# Oxygen: consider only aerobic and anerobic
oxygen.table <- table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)
oxygen.table <- oxygen.table[1:2,]
chisq.test(oxygen.table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  oxygen.table
## X-squared = 15.227, df = 1, p-value = 9.533e-05
kable(oxygen.table)
FALSE TRUE
Aerobic 513 135
Anaerobic 307 37
# And consider facultatively anaerobic
oxygen.table <- table(bacteria.df$oxygen, bacteria.df$bacteria.phbs>0)
oxygen.table <- oxygen.table[c("Aerobic", "Anaerobic", "Facultatively anaerobic"),]
chisq.test(oxygen.table)
## 
##  Pearson's Chi-squared test
## 
## data:  oxygen.table
## X-squared = 55.754, df = 2, p-value = 7.82e-13
# Gram
gram.table <- table(bacteria.df$gram, bacteria.df$bacteria.phbs>0)
gram.table <- gram.table[c("Gram-negative", "Gram-positive", "Gram-variable", "Lack cell wall"),]
kable(gram.table)
FALSE TRUE
Gram-negative 666 240
Gram-positive 572 134
Gram-variable 16 1
Lack cell wall 39 17

Tables

Tables in manuscript, including some additional tables.

Table 1

Table 1 is a review of previous similar studies from the literature, so has no associated code.

Table 2

Shows the niche and specificity of pathogens. According to our definitions, a specialist pathogen infects only a single host, a generalist more than one. Generalists are categorised according to whether their hosts are within the same family (e.g. Bovidae), order (e.g. Artiodactyla), or across orders. Percentages are shown in the manuscript but are not given here.

options(warn=-1)
# NICHE
unique.pathogen.names <- unique_pathogens$Species
unique.pathogen.types <- unique_pathogens$Type
unique_pathogens$N.non.human.hosts <- as.numeric(sapply(unique.pathogen.names, function(x) nrow(pathogen_vs_host_db_no_humans[which(pathogen_vs_host_db_no_humans$Species==x),])))
unique_pathogens$human.host <- as.numeric(sapply(unique.pathogen.names, function(x) nrow(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x & pathogen_vs_host_db$HostSpeciesPHB=="Homosapiens"),])))
unique_pathogens$niche <- ifelse(unique_pathogens$human.host==1 & unique_pathogens$N.non.human.hosts==0, 
                                 "Human-only",
                                 ifelse(unique_pathogens$human.host==0 & unique_pathogens$N.non.human.hosts!=0,
                                        "Animal-only",
                                        ifelse(unique_pathogens$human.host==1 & unique_pathogens$N.non.human.hosts!=0,
                                               "Zoonotic", "Not classified")))
kable(table(unique_pathogens$niche, unique_pathogens$Type))
Bacteria Virus
Animal-only 380 540
Human-only 855 133
Zoonotic 450 237
# SPECIFICITY
unique_pathogens$specificity <- ifelse(unique_pathogens$N.hosts>1,
                                       "Generalist",
                                       ifelse(unique_pathogens$N.non.human.hosts==1 & unique_pathogens$human.host==0,
                                              "Animal-only specialist",
                                              ifelse(unique_pathogens$N.non.human.hosts==0 & unique_pathogens$human.host==1,
                                                     "Human-only specialist", "Not classified")))
all_identical <- function(x) {
  if (length(x) == 1L) {
    warning("'x' has a length of only 1")
    return(TRUE)
  } else if (length(x) == 0L) {
    warning("'x' has a length of 0")
    return(logical(0))
  } else {
    TF <- vapply(1:(length(x)-1),
                 function(n) identical(x[[n]], x[[n+1]]),
                 logical(1))
    if (all(TF)) TRUE else FALSE
  }
}
within.family <- sapply(unique.pathogen.names, function(x) all_identical(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x),"HostFamily"]))
within.order <- sapply(unique.pathogen.names, function(x) all_identical(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x),"HostOrder"]))
within.group <- sapply(unique.pathogen.names, function(x) all_identical(pathogen_vs_host_db[which(pathogen_vs_host_db$Species==x),"HostGroup"]))

unique_pathogens$generalist.narrow.broad <- ifelse(unique_pathogens$N.hosts==1, "Specialist", 
                                                   ifelse(within.family==TRUE, 
                                                          "Within-family", 
                                                          ifelse(within.order==TRUE, 
                                                                 "Within-order", 
                                                                 ifelse(within.family==FALSE & within.order==FALSE & unique_pathogens$N.hosts>1, "Generalist across orders", "Not classified"))))
# SPECIALIST
kable(table(unique_pathogens$specificity, unique_pathogens$Type))
Bacteria Virus
Animal-only specialist 231 254
Generalist 599 523
Human-only specialist 855 133
# GENERALIST NARROW-BROAD
kable(table(unique_pathogens$generalist.narrow.broad, unique_pathogens$Type))
Bacteria Virus
Generalist across orders 508 307
Specialist 1086 387
Within-family 49 132
Within-order 42 84

Table 3

This is a summary of the best-fit GAMs (see `intermediates/*rds’) for host traits which predict viral/bacterial richness (Figure 4) and pathogen traits which predict zoonotic potential (Figure 5).

Supplementary Table 1

Vector-borne pathogens are more likely to be generalists and have a higher median PHB. The below code contains the chi-squared tests mentioned in the manuscript. Supplementary Table 1 is produced from these results, removing those pathogens for which data is not available.

# Virus
table(virus.df$vector.borne)
## 
##   No  Yes Yes? 
##  740  167    3
wilcox.test(virus.df$virus.phbs[which(virus.df$vector.borne=="Yes")], virus.df$virus.phbs[which(virus.df$vector.borne=="No")])
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  virus.df$virus.phbs[which(virus.df$vector.borne == "Yes")] and virus.df$virus.phbs[which(virus.df$vector.borne == "No")]
## W = 90076, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# Vector-borne
quantile(virus.df$virus.phbs[which(virus.df$vector.borne=="Yes")])
##        0%       25%       50%       75%      100% 
## 0.0000000 0.0000000 0.4251265 0.5459129 0.8242654
# Not vector-borne
quantile(virus.df$virus.phbs[which(virus.df$vector.borne=="No")])
##        0%       25%       50%       75%      100% 
## 0.0000000 0.0000000 0.0000000 0.2748792 1.5404218
# Bacteria
table(bacteria.df$vector.borne)
## 
##        No  Yes Yes? 
##    1 1577  105    2
wilcox.test(bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="Yes")], bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="No")])
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne == "Yes")] and bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne == "No")]
## W = 105488, p-value = 1.828e-10
## alternative hypothesis: true location shift is not equal to 0
# Vector-borne
quantile(bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="Yes")])
##        0%       25%       50%       75%      100% 
## 0.0000000 0.0000000 0.0000000 0.5065082 0.7668872
# Not vector-borne
quantile(bacteria.df$bacteria.phbs[which(bacteria.df$vector.borne=="No")])
##        0%       25%       50%       75%      100% 
## 0.0000000 0.0000000 0.0000000 0.0000000 0.9109023

Consider generalist/specialist proportion.

# Virus
virus.vector.table <- table(virus.df$vector.borne, virus.df$virus.phbs>0)
virus.vector.table <- virus.vector.table[c("No", "Yes"),]
chisq.test(virus.vector.table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  virus.vector.table
## X-squared = 60.878, df = 1, p-value = 6.073e-15
# Bacteria
bacteria.vector.table <- table(bacteria.df$vector.borne, bacteria.df$bacteria.phbs>0)
bacteria.vector.table <- bacteria.vector.table[c("No", "Yes"),]
chisq.test(bacteria.vector.table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bacteria.vector.table
## X-squared = 42.322, df = 1, p-value = 7.74e-11

Supplementary Table 2

Viruses with an RNA genome and larger genome size have a greater host range. Having an RNA genome and a larger genome were both significantly associated with greater mean PHB (p<0.001 for both variables) with a non-significant interaction between them (p=0.36).

# Interaction model
virus.df.model <- virus.df[which(virus.df$genome.size!=""),]

# Interaction
combined.model <- lm(virus.phbs ~ genome.type.rna.dna*genome.size, data=virus.df.model)
summary(combined.model)
## 
## Call:
## lm(formula = virus.phbs ~ genome.type.rna.dna * genome.size, 
##     data = virus.df.model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31203 -0.15312 -0.05045  0.15669  0.66023 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         0.04796    0.01529   3.136  0.00179 ** 
## genome.type.rna.dnaRNA              0.18795    0.03073   6.116 1.65e-09 ***
## genome.size                         0.66078    0.14753   4.479 8.85e-06 ***
## genome.type.rna.dnaRNA:genome.size  1.74152    1.91937   0.907  0.36456    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2055 on 655 degrees of freedom
## Multiple R-squared:  0.1771, Adjusted R-squared:  0.1733 
## F-statistic: 46.99 on 3 and 655 DF,  p-value: < 2.2e-16

The following table is not shown in the main text, but it is interesting to note that genome size is not significantly associated with greater PHB in a univariate model.

# Interaction
univariate.size.model <- lm(virus.phbs ~ genome.size, data=virus.df.model)
summary(univariate.size.model)
## 
## Call:
## lm(formula = virus.phbs ~ genome.size, data = virus.df.model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1901 -0.1897 -0.1763  0.1950  0.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.19026    0.01027  18.530   <2e-16 ***
## genome.size -0.07526    0.14621  -0.515    0.607    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2261 on 657 degrees of freedom
## Multiple R-squared:  0.000403,   Adjusted R-squared:  -0.001118 
## F-statistic: 0.2649 on 1 and 657 DF,  p-value: 0.6069

Supplementary Table 3

Bacterial motility and cellular lifestyle are not associated with greater host range. Combining motility and cellular proliferation in a linear model with suggests that neither variable is associated with greater mean PHB.

bacteria.df.model <- bacteria.df[which(bacteria.df$cell %in% c("Extracellular", "Facultative intracellular", "Obligate intracellular") & 
                                         bacteria.df$motility %in% c("Motile", "Non-motile")),]
model <- lm(bacteria.phbs ~ cell + motility, data=bacteria.df.model)
summary(model)
## 
## Call:
## lm(formula = bacteria.phbs ~ cell + motility, data = bacteria.df.model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2886 -0.2635 -0.1811  0.2568  0.6223 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.26354    0.02840   9.278   <2e-16 ***
## cellFacultative intracellular -0.04182    0.03845  -1.088    0.278    
## cellObligate intracellular    -0.00428    0.05110  -0.084    0.933    
## motilityNon-motile             0.02502    0.03726   0.672    0.502    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2929 on 299 degrees of freedom
## Multiple R-squared:  0.006106,   Adjusted R-squared:  -0.003866 
## F-statistic: 0.6124 on 3 and 299 DF,  p-value: 0.6075

Again, univariate linear models and a linear model with an interaction term give the same conclusion.

Cell lifestyle

cell.model <- lm(bacteria.phbs ~ cell, data=bacteria.df.model)
summary(cell.model)
## 
## Call:
## lm(formula = bacteria.phbs ~ cell, data = bacteria.df.model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2843 -0.2744 -0.1669  0.2500  0.6365 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.274376   0.023353  11.749   <2e-16 ***
## cellFacultative intracellular -0.039739   0.038288  -1.038    0.300    
## cellObligate intracellular     0.009906   0.046484   0.213    0.831    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2926 on 300 degrees of freedom
## Multiple R-squared:  0.004607,   Adjusted R-squared:  -0.002029 
## F-statistic: 0.6943 on 2 and 300 DF,  p-value: 0.5002

Motility

motility.model <- lm(bacteria.phbs ~ motility, data=bacteria.df.model)
summary(motility.model)
## 
## Call:
## lm(formula = bacteria.phbs ~ motility, data = bacteria.df.model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2753 -0.2753 -0.1907  0.2551  0.6356 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.24949    0.02527   9.873   <2e-16 ***
## motilityNon-motile  0.02585    0.03384   0.764    0.445    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2925 on 301 degrees of freedom
## Multiple R-squared:  0.001935,   Adjusted R-squared:  -0.001381 
## F-statistic: 0.5836 on 1 and 301 DF,  p-value: 0.4455

Combined model with interaction term

interaction.model <- lm(bacteria.phbs ~ cell*motility, data=bacteria.df.model)
summary(interaction.model)
## 
## Call:
## lm(formula = bacteria.phbs ~ cell * motility, data = bacteria.df.model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2937 -0.2596 -0.1862  0.2530  0.6172 
## 
## Coefficients: (1 not defined because of singularities)
##                                                   Estimate Std. Error t value
## (Intercept)                                       0.259634   0.031092   8.351
## cellFacultative intracellular                    -0.030196   0.053653  -0.563
## cellObligate intracellular                       -0.009388   0.053745  -0.175
## motilityNon-motile                                0.034035   0.047243   0.720
## cellFacultative intracellular:motilityNon-motile -0.023963   0.077047  -0.311
## cellObligate intracellular:motilityNon-motile           NA         NA      NA
##                                                  Pr(>|t|)    
## (Intercept)                                      2.59e-15 ***
## cellFacultative intracellular                       0.574    
## cellObligate intracellular                          0.861    
## motilityNon-motile                                  0.472    
## cellFacultative intracellular:motilityNon-motile    0.756    
## cellObligate intracellular:motilityNon-motile          NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2933 on 298 degrees of freedom
## Multiple R-squared:  0.006429,   Adjusted R-squared:  -0.006908 
## F-statistic: 0.4821 on 4 and 298 DF,  p-value: 0.7489